122 research outputs found

    Identification of SNP interactions using logic regression

    Get PDF
    Interactions of single nucleotide polymorphisms (SNPs) are assumed to be responsible for complex diseases such as sporadic breast cancer. Important goals of studies concerned with such genetic data are thus to identify combinations of SNPs that lead to a higher risk of developing a disease and to measure the importance of these interactions. There are many approaches based on classification methods such as CART and Random Forests that allow measuring the importance of single variables. But with none of these methods the importance of combinations of variables can be quantified directly. In this paper, we show how logic regression can be employed to identify SNP interactions explanatory for the disease status in a case- control study and propose two measures for quantifying the importance of these interactions for classification. These approaches are then applied, on the one hand, to simulated data sets, and on the other hand, to the SNP data of the GENICA study, a study dedicated to the identification of genetic and gene-environment interactions associated with sporadic breast cancer. --Single Nucleotide Polymorphism,Feature Selection,Variable Importance Measure,GENICA

    Imputing missing genotypes with weighted k nearest neighbors

    Get PDF
    Motivation: Missing values are a common problem in genetic association studies concerned with single nucleotide polymorphisms (SNPs). Since most statistical methods cannot handle missing values, they have to be removed prior to the actual analysis. Considering only complete observations, however, often leads to an immense loss of information. Therefore, procedures are needed that can be used to replace such missing values. In this article, we propose a method based on weighted k nearest neighbors that can be employed for imputing such missing genotypes. Results: In a comparison to other imputation approaches, our procedure called KNNcatImpute shows the lowest rates of falsely imputed genotypes when applied to the SNP data from the GENICA study, a study dedicated to the identification of genetic and gene-environment interactions associated with sporadic breast cancer. Moreover, in contrast to other imputation methods that take all variables into account when replacing missing values of a particular variable, KNNcatImpute is not restricted to association studies comprising several ten to a few hundred SNPs, but can also be applied to data from whole-genome studies, as an application to a subset of the HapMap data shows. --

    A note on the simultaneous computation of thousands of Pearson’s Chi^2-statistics

    Get PDF
    In genetic association studies, important and common goals are the identification of single nucleotide polymorphisms (SNPs) showing a distribution that differs between several groups and the detection of SNPs with a coherent pattern. In the former situation, tens of thousands of SNPs should be tested, whereas in the latter case typically several ten SNPs are considered leading to thousands of statistics that need to be computed. A test statistic appropriate for both goals is Pearson’s Chi^2-statistic. However, computing this (or another) statistic for each SNP or pair of SNPs separately is very time-consuming. In this article, we show how simple matrix computation can be employed to calculate the Chi^2-statistic for all SNPs simultaneously

    Comparison of the empirical bayes and the significance analysis of microarrays

    Get PDF
    Microarrays enable to measure the expression levels of tens of thousands of genes simultaneously. One important statistical question in such experiments is which of the several thousand genes are differentially expressed. Answering this question requires methods that can deal with multiple testing problems. One such approach is the control of the False Discovery Rate (FDR). Two recently developed methods for the identification of differentially expressed genes and the estimation of the FDR are the SAM (Significance Analysis of Microarrays) procedure and an empirical Bayes approach. In the two group case, both methods are based on a modified version of the standard t-statistic. However, it is also possible to use the Wilcoxon rank sum statistic. While there already exists a version of the empirical Bayes approach based on this rank statistic, we introduce in this paper a new version of SAM based on Wilcoxon rank sums. We furthermore compare these four procedures by applying them to simulated and real gene expression data. --Identification of differentially expressed genes,Gene expression,Multiple Testing,False Discovery Rate

    Detecting high-order interactions of single nucleotide polymorphisms using genetic programming

    Get PDF
    Motivation: Not individual single nucleotide polymorphisms (SNPs), but high-order interactions of SNPs are assumed to be responsible for complex diseases such as cancer. Therefore, one of the major goals of genetic association studies concerned with such genotype data is the identification of these high-order interactions. This search is additionally impeded by the fact that these interactions often are only explanatory for a relatively small subgroup of patients. Most of the feature selection methods proposed in the literature, unfortunately, fail at this task, since they can either only identify individual variables or interactions of a low order, or try to find rules that are explanatory for a high percentage of the observations. In this paper, we present a procedure based on genetic programming and multi-valued logic that enables the identification of high-order interactions of categorical variables such as SNPs. This method called GPAS (Genetic Programming for Association Studies) cannot only be used for feature selection, but can also be employed for discrimination. Results: In an application to the genotype data from the GENICA study, an association study concerned with sporadic breast cancer, GPAS is able to identify high-order interactions of SNPs leading to a considerably increased breast cancer risk for different subsets of patients that are not found by other feature selection methods. As an application to a subset of the HapMap data shows, GPAS is not restricted to association studies comprising several ten SNPs, but can also be employed to analyze whole-genome data. --

    Comparison of the Empirical Bayes and the Significance Analysis of Microarrays

    Get PDF
    Microarrays enable to measure the expression levels of tens of thousands of genes simultaneously. One important statistical question in such experiments is which of the several thousand genes are differentially expressed. Answering this question requires methods that can deal with multiple testing problems. One such approach is the control of the False Discovery Rate (FDR). Two recently developed methods for the identification of differentially expressed genes and the estimation of the FDR are the SAM (Significance Analysis of Microarrays) procedure and an empirical Bayes approach. In the two group case, both methods are based on a modified version of the standard t-statistic. However, it is also possible to use the Wilcoxon rank sum statistic. While there already exists a version of the empirical Bayes approach based on this rank statistic, we introduce in this paper a new version of SAM based on Wilcoxon rank sums. We furthermore compare these four procedures by applying them to simulated and real gene expression data

    Integrated analysis of copy number alterations and gene expression: a bivariate assessment of equally directed abnormalities

    Get PDF
    Motivation: The analysis of a number of different genetic features like copy number (CN) variation, gene expression (GE) or loss of heterocygosity has considerably increased in recent years, as well as the number of available datasets. This is particularly due to the success of microarray technology. Thus, to understand mechanisms of disease pathogenesis on a molecular basis, e.g. in cancer research, the challenge of analyzing such different data types in an integrated way has become increasingly important. In order to tackle this problem, we propose a new procedure for an integrated analysis of two different data types that searches for genes and genetic regions which for both inputs display strong equally directed deviations from the reference median. We employ this approach, based on a modified correlation coefficient and an explorative Wilcoxon test, to find DNA regions of such abnormalities in GE and CN (e.g. underexpressed genes accompanied by a loss of DNA material). Results: In an application to acute myeloid leukemia, our procedure is able to identify various regions on different chromosomes with characteristic abnormalities in GE and CN data and shows a higher sensitivity to differences in abnormalities than standard approaches. While the results support various findings of previous studies, some new interesting DNA regions can be identified. In a simulation study, our procedure also shows more reliable results than standard approaches. Availability: Code and data available as R packages edira and ediraAMLdata from http://www.statistik.tu-dortmund.de/~schaefer/ Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics onlin
    corecore